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Summary. We consider the problem of estimating multiple related Gaussian graphical models from a high-dimensional 
data set with observations belonging to distinct classes. We propose the joint graphical lasso, which borrows strength 
across the classes in order to estimate multiple graphical models that share certain characteristics, such as the loca- 
tions or weights of nonzero edges. Our approach is based upon maximizing a penalized log likelihood. We employ 
generalized fused lasso or group lasso penalties, and implement a fast ADMM algorithm to solve the corresponding 
convex optimization problems. The performance of the proposed method is illustrated through simulated and real 
data examples. 
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1. Introduction 



In recent years, much interest has focused upon estimating an undirected graphical model on the basis of a 
n x p data matrix X, where n is the number of observations and p is the number of features. Suppose that the 
observations xi, . . . , x„ £ R p are independent and identically distributed N(fi, E), where fi £ K p and S is a 
positive definite p x p matrix. Then zeros in the inverse covariance matrix E _1 correspond to pairs of features 
that are conditionally independent - that is, pairs of variables that are independent of each other, given all of the 
other variables in the data set. In a Gaussian graphical model ( Lauritzen] |l996 ) , these conditional dependence 
relationships are represented by a graph in which nodes represent features and edges connect conditionally 
dependent pairs of features. 

A natural way to estimate the precision (or concentration) matrix S _1 is via maximum likelihood. Letting 
S denote the empirical covariance matrix of X, the Gaussian log likelihood takes the form (up to a constant) 



(logdetS" 1 -trace(SS" 1 )) 



(1.1) 



Maximizing (1.1) with respect to S 1 yields the maximum likelihood estimate S 



However, two problems can arise in using this maximum likelihood approach to estimate S . First, in 
the high-dimensional setting where the number of features p is larger than the number of observations n, the 
empirical covariance matrix S is singular and so cannot be inverted to yield an estimate of E . If p « n, then 
even if S is not singular, the maximum likelihood estimate for E _1 will suffer from very high variance. Second, 
one often is interested in identifying pairs of variables that are unconnected in the graphical model, i.e. that 
are conditionally independent; these correspond to zeros in E . But maximizing the log likelihood (1.1) will 
in general yield an estimate of S -1 with no elements that are exactly equal to zero. 

In recent years, a number of proposals have been made for estimating E _1 in the high-dimensional setting 



in such a way that the resulting estimate is sparse. Meinshausen & Biihlmann ( 2006 ) proposed doing this via 



a penalized regression approach, which was extended by Peng et al. ( 2009 1 . A number of authors have instead 
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taken a penalized log likelihood approach ( Yuan k. Lin||20076 Friedman, Hastie & Tibshirani 2007 Rothman 
et al.|[2008 ): rather than maximizing (1.1), one can instead solve the problem 



maximize© {logdet© — trace(S0) — A||0||i} , 



(1.2) 



where A is a nonnegative tuning par ameter. The solu tion to this optimization problem provides an estimate 
for The use of an l\ or lasso (Tibshirani 1996) penalty on the elements of has the effect that when 



the tuning parameter A is large, some elements of the resulting precision matrix estimate will be exactly equal 



to zero. Moreover, (1.2) can be solved even if p ^> n. The solution to the problem (1.2) is referred to as the 



graphical lasso. Some authors have proposed applying the l\ penalty in (1.2) only to the off-diagonal elements 
of 0. 

Graphical models are especially of interest in the analysis of gene expression data, since it is believed that 
genes operate in pathways, or networks. Graphical models based on gene expression data can provide a useful 
tool for visualizing the relationships among genes and for generating biological hypotheses. The standard 
formulation for estimating a Gaussian graphical model assumes that each observation is drawn from the same 
distribution. However, in many datasets the observations may correspond to several distinct classes, so the 
assumption that all observations are drawn from the same distribution is inappropriate. For instance, suppose 
that a cancer researcher collects gene expression measurements for a set of cancer tissue samples and a set of 
normal tissue samples. In this case, one might want to estimate a graphical model for the cancer samples and a 
graphical model for the normal samples. One would expect the two graphical models to be similar to each other, 
since both are based upon the same type of tissue, but also to have important differences stemming from the 
fact that gene networks are often dysregulated in cancer. Estimating separate graphical models for the cancer 
and normal samples does not exploit the similarity between the true graphical models. And estimating a single 
graphical model for the cancer and normal samples ignores the fact that we do not expect the true graphical 
models to be identical, and that the differences between the graphical models may be of interest. 

In this paper, we propose the joint graphical lasso, a technique for jointly estimating multiple graphical 
models corresponding to distinct but related conditions, such as cancer and normal tissue. Our approach is 
an extension of the graphical lasso (1.2) to the case of multiple data sets. It is based upon a penalized log 



likelihood approach, where the choice of penalty depends on the characteristics of the graphical models that we 
expect to be shared across conditions. 

We illustrate our method with a small toy example that consists of observations from two classes. Within 
each class, the observations are independent and identically distributed according to a normal distribution. The 
two classes have distinct covariance matrices. When we apply the graphical lasso separately to the observations 
in each class, the resulting graphical model estimates are less accurate than when we use our joint graphical 
lasso approach. Results are shown in Figure [lj 



(a) 



lb] 



(c) 







Fig. 1. Comparison of the graphical lasso with our joint graphical lasso in a toy example with two conditions, p = 10 
variables, and n=200 observations per condition, (a): True networks, (b): Networks estimated by applying the graphical 
lasso separately to each class, (c); Networks estimated by applying our joint graphical lasso proposal. 
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The rest of this paper is organized as follows. In Section [2] we present the joint graphical lasso optimization 
problem. Section [3] contains an alternating directions method of multipliers algorithm for its solution. In Section 
4] we present theoretical results that lead to massive gains in the algorithm's computational efficiency. Section 
5] contains a discussion of related approaches from the literature, and in Section [6] we discuss tuning parameter 
selection. In Section [7J we illustrate the performance of our proposal in a simulation study. Section [8] contains 
an application to a lung cancer gene expression dataset. The discussion is in Section [9| 



2. The joint graphical lasso 

We briefly introduce some notation that will be used throughout this paper. 

We let K denote the number of classes in our data, and let XT 1 denote the true precision matrix for the 

_i _i . 

fcth class. We will seek to estimate Y, 1 , . . . , Ti K by formulating convex optimization problems with arguments 
{©} = 0W, . . . , ®( K >. The solutions 0W, . . . , ©( K > to these optimization problems will constitute estimates 
of ^\...,T,- K \ 

We will index matrix elements using i = 1, . . . ,p, j — 1, . . . ,p, and will index classes using k — 1, . . . , K. 
||A||ir will denote the Frobenius norm of matrix A, i.e. ||A||.p = \fJ2i J2j ■ 



2. 1. The general formulation for the joint graphical lasso 

Suppose that we are given K data sets, Y^, . . . , Y", with K > 2. Y( fc ) is a n k x p matrix consisting of n k 
observations with measurements on a set of p features, which are common to all K data sets. Furthermore, 
we assume that the 2fc=i n k observations are independent, and that the observations within each data set are 

(k) (k) 

identically distributed: ,...,y, lf . ~ iV(/x fc ,Sfe). Without loss of generality, we assume that the features 
within each data set are centered such that fi k — 0. We let S^ k ' = ^-(Y^) T Y^ k \ the empirical covariance 

matrix for . The log likelihood for the data takes the form (up to a constant) 



1 K 

£({0}) = 2^Z nfc (logdet©( fe ) -trace(S ( ' £) e (fc) ; 



fc=i 



(2.3) 



Maximizing ( 2.3 1 with respect to 0*- 1 ) , . . . , W yields the maximum likelihood estimate (S*- 1 ^ 



However, depending on the application, the maximum likelihood estimates that result from (2.3) may not 
be satisfactory. When p is smaller than but close to n k , the maximum likelihood estimate can have very 
high variance, and no elements of (S^ 1 )) -1 , . . . , (S^) -1 will be zero, leading to difficulties in interpretation. 
In addition, when p > n k , the maximum likelihood estimate becomes ill-defined. Moreover, if the K data 
sets correspond to observations collected from K distinct but related classes, then one might wish to borrow 
strength across the K classes to estimate the K precision matrices, rather than estimating each precision matrix 
separately. 

Therefore, instead of estimating TjT , ■ ■ ■ , S^- 1 by maximizing (2.3), we consider the penalized log likelihood 
and seek {0} solving 



maximize 



{&} 




(logdet0« - trace (s^©^) - P{{&}) \ 



(2.4) 



subject to the constraint that 0W, . . . , are positive definite. Here P({®}) denotes a convex penalty 



function, so that the objective in (2.4) is strictly concave in {©}. We propose to choose a penalty function P 
that will encourage 0^', . . . , ©^ to share certain characteristics, such as the locations or values of the nonzero 
elements; moreover, we would like the estimated precision matrices to be sparse. In particular, we will consider 
penalty functions that take the form P({0}) = P({0}) + Ai J2k ^i^j I' where P is a convex function and 



Ai is a nonnegative tuning pa rame ter. When P({©}) = 0, (2.4) amounts to performing K uncoupled graphical 
lasso optimization problems (1.2). The P penalty is chosen to encourage similarity across the K estimated 



precision matrices; therefore, we refer to the solution to (2.4) as the joint graphical lasso (JGL). We discuss 



specific forms of the penalty function in (2.4) in the next section 
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2.2. Two useful penalty functions 

In this subsection, we introduce two particular choices of the convex penalty function P in (2.4) that lead to 
useful graphical model estimates. In Appendix 1, we further extend these proposals to work on the scale of 
partial correlations. 



2.2.1. The fused graphical lasso 

The fused graphical lasso (FGL) is the solution to the problem (2.4) with the penalty 



K 



7 ij 



(2.5) 



fc=i i^j 



k<k' i,j 



where Ai and A2 are nonnegative tuning parameters. This is a generalized fused lasso penalty (Hocfling 20106), 
and results from applying £j penalties to (1) each off-diagonal element of the K precision matrices, and (2) 
differences between corresponding elements of each pair of precision matrices. Like the graphical lasso, FGL 
results in sparse estimates . . . , 8^ when the tuning parameter Ai is large. In a ddition, many element s 

of ©W, . . . , 0(*O will be identical across classes when the tuning parameter A2 is large ( jTibshirani et al. |2005| ). 
Thus FGL borrows information aggressively across classes, encouraging not only similar network structure but 
also similar edge values. 



2.2.2. The group graphical lasso 

We define the group graphical lasso (GGL) to be the solution to (2.4) with 



K 



fe=i i& 



E 



\ 



K 

E- 

k=l 



(2.6) 



Again, Ai and A2 are nonnegative tuning parameters. A lasso penalty is applied to the elements of the precision 



matrices and a group lasso penalty is applied to the (i,j) element across all K precision matrices (Yuan & Lin 



2007a). This group lasso penalty encourages a similar pattern of sparsity across all of the precision matrices - 
that is, there will be a tendency for the zeros in the K estimated precision matrices to occur in the same places. 
Specifically, when Ai = and A 2 > 0, each 0W will have an identical pattern of non-zero elements. On the 
other hand, the lasso penalty encourages further sparsity within each ®( k \ 

GGL encourages a weaker form of similarity across the K precision matrices than does FGL: the latter 
encourages shared edge values across the K matrices, whereas the former encourages only a shared pattern of 
sparsity. 



3. Algorithm for the joint graphical lasso problem 

3.1. An ADMM algorithm 

We solve the problem (2.4) using an alternating directions method of multipliers (ADMM) algorithm. We refer 
the reader to Boyd et al. (2010) for a thorough exposition of ADMM algorithms as well as their convergence 
properties, and to Simon & Tibshirani (2012) and Mohan et al. (20121 for recent applications of ADMM to 
related problems. 

To solve the problem (2.4) subject to the constraint that 0( fe ) is positive definite for k — 1, . . . , K using 
ADMM, we note that the problem can be rewritten as 

■^2 n k (logdete (fc) -trace(S (fe) (fe) )) + P({Z}) I , (3.7) 

subject to the positive-definiteness constraint as well as the constraint that = 0( fc ) for k = 1, . . . , K, where 



minimize 

{©b{Z} 



{Z} = {Z« 
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, Z( '}. The scaled augmented Lagrangian ( |Boyd et al.||2010[ ) for this problem is given by 



L p ({0},{Z},{U}) = - 



K 

E 

k=l 



nk 



(logdet©^ -trace(SW©^)) + P({Z}) 

^lieW-zW + uWul, 



(3.8) 



fc=i 



where {U} = {U' 1 ', . . . , U^} are dual variables. Roughly speaking, an ADMM algorithm corresponding to 
(3.8) results from iterating three simple steps. At the zth iteration, they are as follows: 

(a) {©(;)} <- argmin {e} {L p ({©}, {Z (i _ x) }, {U (i _ 1} })}. 

(b) {Z (i) } 4- argmin {z} {L p ({0 (l) }, {Z}, {U^)})}. 

(c) {U (<) } <- {U (i _ 1) } + ({0 W } - {Z (i) }). 

We now present the ADMM algorithm in greater detail. 

ADMM algorithm for solving the joint graphical lasso problem 

(a) Initialize the variables: ©W = I, IjW = 0, Z< fe ) = for k = 1, . . . , K. 

(b) Select a scalar p > 0. 

(c) For i — 1, 2, 3, . . . until convergence: 



(i) For fe = 1, .... K, update &^ ( k ) as the minimizer (with respect to &^) of 



-nk 



logdet0 (fc) -trace(S (fc) ( 



£||©(fc) 
2 M 



f(fe) 
J (*-i) 



U 



(AO 
(i-1) 



IF- 



Letting VDV T denote the eigendecomposition of — pz/fi^/rik + pXJ^^/rik, the solution is 



given (Witten & Tibshirani 20091 by VDV T , where D is the diagonal matrix with jth diagonal 

'-D 4 



element 



2p 



'33 



4p/n k 



(ii) Update {Z^\} as the minimizer (with respect to {Z}) of 



:£l|z«-(e[* 



^({z}). 



(3.9) 



(hi) For k = 1,. 



, K, update as u[^ 1} 



(0^-Z^). 

V (») (i) ; 



The final ©W, . . . , 8'^ that result from this algorithm are the JGL estimates of S^ 1 , . . . , S^ 1 . This algorithm 
is guaranteed to converge to the global optimum (Boyd et al. 20101. We note that the positive-definiteness 
constraint on the estimated precision matrices is naturally enforced by the update in Step (c)(i). 

Details of the minimization of (3.9 1 will depend on the form of the convex penalty function P. We note that 
the task of minimizing (3.9) can be re-written as 



K 



minimize 

{Z} 



|£||Z«-A«||| 
. fc=i 



^({Z}) 



where 



A« = 0< fc ? + Y 



(fc) 

(<-!)• 



(3.10) 



(3.11) 



We will see in Section 3.2 that for the FGL and GGL penalties, solving (3.10) is a simple task, regardless of the 
value of K. 

The algorithm given above involves computing the eigen decomposition of a p x p matrix, which can be 
computationally demanding when p is large. However, in Section |4j we will present two theorems that reveal 
that when the values of the tuning parameters Ai and A2 are large, one can obtain the exact solution to the JGL 
optimization problem without ever computing the eigen decomposition of a p x p matrix. Therefore, solving 
the JGL problem is fast even when p is quite large. In Section |8j we will see that one can perform FGL with 
K = 2 classes and almost 18,000 features in under 2 minutes. 
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3.2. Solving \3. for the joint graphic al lass o 

We now consider the problem of solving (3.10 1 if P is a generalized fused lasso or group lasso penalty. 



3.2.1. Solving \ 3.10) for FGL 



If P is the penalty given in (2.5 1 , then (3.101 takes the form 

K K 



minimize t £ || Z « - A«||» + A, £ £ \Z®\ + A 2 £ £ " 3 

' I fe=l k=l ijtj k<k' i,j 



(3.12) 



Now (3.12) is completely separable with respect to each pair of matrix elements that is, one can simply 

solve, for each 



™- { I h z % ] - 4 } ) 2 + Aii^ E 1^1 + a 2 E \ Z S ] - ^i} 

z ij , — > Z ij \ k=l k=l k<k' ) 



(3.13) 



This is a special case of the fused lasso signal approximator ( Hoefling|201Qp ) in which there is a fusion between 
each pair of variables. A very efficient algorithm for this special case, which can be performed in 0(K log K) 
operations, is available (Hocking et al. 2011, Hocfling 2010a Tibshirani 2012). 

In fact, when K = 2, (3.131 has a very simple closed form solution. When Ai = 0, it is easy to verify that 



the solution to (3.13) takes the form 



(4> 

id) 



A a /p, A 2) + A 2 /p) if A§> > Aff + 2\ 2 / P 



{Z ( i ]\z$) = { (A$ + \ 2 /p,A$-\ 2 /p) ifA%>A\? + 2X 2 /p 

A (1) +A (2) A (1) +A (2) cil i"51 

i^^, ^4^) if - 4 2) i < 2A 2 / P 



I (2) 



(3.14) 



And when Ai > 0, the solution to ( 3.13 ) can be obtained through soft-thresholding ( 3.14 ) by Xi/p (see Friedman^ 
Hastie, Hoefling fc Tibshirani|2007 ). Here the soft-thresholding operator is defined as S(x, c) = sgn(x)(|x| — c)+, 
where a + = max(a, 0). 



3.2.2. Solving |Oo| ) for GGL 



If P is the group lasso penalty (2.6), then (3.10) takes the form 

K K 



minimize t £ || Z « - A«|& + A, E E ^ + ^ E JE 3 



,-,2 



(3.15) 



k=l 



k=l 



First, for all i = 1, . . . ,p and fc = 1, . . . , K, it is easy to see that the solution to (|3.15|) has = A^' . And 
one can show that the off-diagonal elements take the form ( Friedman et al.||2010 ) 



>( fe ) _ Of A( k ) 



S(A^>,Xi/p) 1- 



A 2 



(3.16) 



where 5 denotes the soft-thresholding operator. 



4. Faster computations for FGL and GGL 

We now present two theorems that lead to substantial computational improvements to the JGL algorithm 
presented in Section [3] Using these theorems, one can inspect the empirical covariance matrices . . . , 
in order to determine whether the solution to the JGL optimization problem is block diagonal after some 
permutation of the features. Then one can simply perform the JGL algorithm on the features within each 
block separately, in order to obtain exactly the same solution that would have been obtained by applying the 
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algorithm to all p features. This leads to huge speed improvements since it obviates the need to ever compute 
the eigen decomposition of a p x p matrix. Our results mirror recent improvements in algorithms for solving 
the graphical lasso problem ( Witten et al.poTT Mazumder & Hastic 2012). 



For instance, suppose that for a given choice of Ai and A2, we determine that the estimated inverse co- 
variance matrices . . . , ®( K > are block diagonal, each with the same R blocks, the rth of which contains 
p r features, X^r=i Pr = V- Then in each iteration of the JGL algorithm, rather than having to compute the 
eigen decomposition of K p x p matrices, we need only compute eigen decompositions of matrices of dimension 
Pi xpi,... ,pr x pn. This leads to a potentially massive reduction in computational complexity from 0(p 3 ) to 

E*iO(p?). 



We begin with a very simple lemma for which the proof follows by inspection of (2.4 1. The lemma can be 
extended by induction to any number of blocks. 

Lemma 4.1. Suppose that the solution to the FGL or GGL optimization problem is block diagonal with known 
blocks. That is, the features can be reordered in such a way that each estimated inverse covariance matrix takes 
the form 



©(*) = ^ ° 
\ ©i fe) 



(4.17) 



where each of&^\ . . . , ©^ has the same dimension. Then, ©^ , . . . , ©1 and 2 , • ■ ■ , ©2 can be obtained 
by solving the FGL or GGL optimization problem on just the corresponding set of features. 

We now present the key results. Theorems[l]and[2]outline necessary and sufficient conditions for the presence 
of block diagonal structure in the FGL and GGL optimization problems, and are proven in Appendix 2. 

Theorem 1. Consider the FGL optimization problem with K = 2 classes. Let C\ and C2 be a non-overlapping 
partition of the p variables such that C\ D C2 — 0, C\ U C2 — {1, ■ ■ • ,p}- The following conditions are necessary 
and sufficient for the variables in C\ to be completely disconnected from those in C2 in each of the resulting 
network estimates: 

(a) KSJ ! < Ai + A 2 for all i e d and j e C 2 , 

(2) 

(b) \n2S\j \ < \i + A 2 for all i £ Ci and j <E C2, and 

(c) ImSff + n 2 S\f\ < 2Ai for all 1 e C x and jeC 2 . 



Furthermore, if K > 2, then 



\n k S^\ < Ai for all i e C u j E C 2 , k = l,...,K (4.18) 



is a sufficient condition for the variables in C\ to be completely disconnected from those in C 2 . 

Theorem 2. Consider the GGL optimization problem with K > 2 classes. Let C\ and C 2 be a non-overlapping 
partition of the p variables, such that C\ PI C 2 — 0, C\ U C 2 = {1, . . . ,p}. The following condition is necessary 
and sufficient for the variables in C\ to be completely disconnected from those in C 2 in each of the resulting 
network estimates: 



K 

£( 

k=l 



n k S\ k M - Aift < X 2 2 for all i € C u j € C 2 . (4.19) 



Theorems [T] and [2] allow us to quickly check if, given a partition of the features C\ and C 2 , the solution 
to the JGL optimization problem is block diagonal with one block corresponding to features in C\ and one 
block corresponding to features in C 2 . In practice, for any given (Ai, A 2 ), we can quickly perform the following 
two-step procedure to identify any block structure in the FGL or GGL solution: 

(a) Create M, a,pxp matrix with Ma = 1 for i = 1, . . . ,p. For i 7^ j, let Mij = if the conditions specified in 
Theorem [T] are met for that pair of variables and the FGL penalty is used, or if the condition of Theorem 
[2] is met for that pair of variables and the GGL penalty is used. Otherwise, set My = 1. 

(b) Identify the connected components of the undirected graph whose adjacency matrix is given by M. Note 
that this can be performed in 0(|A/|) operations, where \M\ is the number of nonzero elements in M 
( |Tm^[T972| ). 



8 Danaher et al. 



Theorems [T] and [2] guarantee that the connected components identified in Step (b) correspond to distinct blocks 
in the FGL or GGL solutions. Therefore, one can quickly obtain these solutions by solving the FGL or GGL 
optimization problems on the submatrices of these K p x p empirical covariance matrices that correspond to 
that block diagonal structure. Consequently, we can obtain the exact solution to the JGL optimization problem 
on extremely high-dimensional data sets that would otherwise be computationally intractable. For instance, in 
Section [8] we performed FGL on a gene expression data set with almost 18, 000 features in under two minutes. 

As pointed out by a reviewer, Theorems [l] and [2] lead to speed improvements only if the tuning parameters 
Ai and A2 are sufficiently large. We argue that this will in fact be the case in most practical applications of 
JGL. When network estimation is performed for the sake of data exploration and when p is large, only a very 
sparse network estimate will be useful; otherwise, interpretation of the estimate will be impossible. Even when 
data exploration is not the end goal of the analysis, large values of Ai and A2 will generally be used, since most 
data sets cannot reasonably support estimation of Kp(p + l)/2 nonzero parameters when nCp. 



5. Relationship to previous proposals 



Several past proposals have been made to jointly estimate graphical models on the basis of observations drawn 
from distinct conditions. Some proposals have used time-series data to define time-varying networks in the 
context of continuous or binary data JZhou et al.||2008| jSong et al.||2009a| |Ahmed fc Xmg||2009[ |Kolar fc Xing] 



2009 Song et al. 20096 Kolar et al. 2010). Guo et al. (2011) instead describe a likelihood-based method for 



estimating precision matrices across multiple related classes simultaneously. They employ a hierarchical penalty 
that forces similar patterns of sparsity across classes, an approach that is similar in spirit to GGL. 

Our FGL and GGL proposals have a number of advantages over these existing approaches. Methods for 
estimating time-varying networks cannot be easily extended to the setting where the classes lack a natural 
ordering. Guo et al. (2011 )'s proposal is a closer precursor to our method, and can in fact be stated as an 



instance of the problem (2.4) with a hierarchical group lasso penalty 



a(*)i 

ij I 



(5.20) 



that encourages a shared pattern of sparsity across the K classes. But the approach of Guo et al. (2011) has 



a number of disadvantages relative to FGL and GGL. (1) The penalty (5.20) is not convex, so convergence to 
the wrong local maximum is possible. (2) Because (5.20) is not convex, it is not possible to achieve the speed 
improvements described in Section 4 Consequently, the Guo et al. (2011) proposal is quite slow relative to our 
approach, as seen in Figures [2je) , 4 'e) , and [5je) , and essentially cannot be applied to very high-dimensional 
data sets. (3) Unlike FGL and GGL, it uses just one tuning parameter, and is unable to control separately the 
sparsity level and the extent of network similarity. (4) In cases where we expect edge values as well as network 
structure to be similar between classes, FGL is much better suited than GGL andlGuo et al. (2011 )'s proposal, 



both of which encourage shared patterns of sparsity but ignore the signs and values of the nonzero edges. 
Guo et al. (2011 1's proposal is included in the simulation study in Section [7] 



6. Tuning parameter selection 

One can select tuning parameters for JGL using an approximation of the Akaike Information Criterion (AIC), 



K 

AIC(\r, A 2 ) = £ [n fe trace(sW©g A2 ) - n k logdet 8*^ + 2E k 



fc=i 



(6.21) 



[* } is the set of estimated inverse covariance matrices based on tuning parameters Ai and A 2 , 



and 



where {0 

Ek is the number of non-zero elements in &\ K ^ ^ . A grid search can then be performed to select Ai and A2 
minimizing the AIC(\i, A2) score. The simulation study in Section 7 suggests that this criterion tends to select 
models whose Kullback-Leibler (dKL) divergence from the true model is low. When the number of variables p 
is very large, computing AIC(\\, A2) over a range of values for Ai and A2 may prove computationally onerous. 
If this is the case, we suggest a dense search over Ai followed by a quick search over A 2 . 

It is worth noting that in most cases, network estimation is performed as a part of exploratory data analysis 
and hypothesis generation. For these purposes, approaches such as AIC, BIC, and cross-validation may tend to 
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choose models too large to be useful. In this setting, model selection should be guided by practical considera- 
tions, such as network interpretability, stability, and the desire for an edge set with a low false discovery rate 
(|Meinshausen fc Buhlmann||2010| |Li et al.||2011[). 



7. Simulation study 



We compare the performances of FGL and GGL to two existing methods, graphical lasso and Guo et al. (2011 1's 



proposal, in Section 7.1 When applying the graphical lasso, networks arc fitted for each class separately. We 



investigate the effects of n and p on FGL and GGL's performances in Section 7.2 Additional simulation results 
are presented in Appendix 3. 

The effects of the FGL and GGL penalties vary with the sample size. For ease of presentation of the 
simulation study results, we multiply the reported tuning parameters Ai and A2 by the sample size of each class 
before performing JGL. 

To ease interpretation, we reparametrize the GGL penalties in our simulation study. The motivation is to 
summarize the regularization for "sparsity" and for "similarity" separately. In FGL, this is nicely achieved by 
just using Ai and A2, as the former drives network sparsity and the latter drives network similarity. In contrast, 
in GGL, both tuning parameters contribute to sparsity: Ai drives individual network edges to zero whereas A2 
simultaneously drives network edges to zero across all K network estimates. We reparameterize our simulation 
results for GGL in terms of u)\ = Ai + -^^2 and tu-2 = + ^7^2), which we found to reasonably reflect 

the levels of "sparsity" and "similarity" regularization, respectively. 



7. 1 . Performance as a function of tuning parameters 

7.1.1. Simulation set-up 

In this simulation, we consider a three-class problem. We first generate three networks with p = 500 features 
belonging to ten equally sized unconnected subnetworks, each with a power law degree distribution. Power law 



degree distributions are thought to mimic the structure of biological networks ( Chen & Sharp 2004 1 and are 
generally harder to estimate than simpler structures ( Peng et al.|20 09). Of the ten subnetworks, eight have the 
same structure and edge values in all three classes, one is identical between the first two classes and missing in 
the third (i.e. the corresponding features are singletons in the third network), and one is present in only the 
first class. The topology of the networks generated is shown in Figure [6] in Appendix 4. 

Given a network structure, we generate a covariance matrix for the first class as follows ( Peng et al.| [2009). 
We create a p x p matrix with ones on the diagonal, zeroes on elements not corresponding to network edges, and 
values from a uniform distribution with support on {[—.4, —.1] U [.1, .4]} on elements corresponding to edges. 
To ensure positive definiteness, we divide each off-diagonal element by 1.5 times the sum of the absolute values 
of off-diagonal elements in its row. Finally, we average the matrix with its transpose, achieving a symmetric, 
positive-definite matrix A. We then create the element of Si as 



dij(A )y/y(A r )ii(A 

where dij — 0.6 if i ^ j and dij = 1 if i = j. We create S 2 equal to Si, then reset one of its ten subnetwork 
blocks to the identity. We create S3 equal to S2, and reset an additional subnetwork block to the identity. 
Finally, for each class we generate independent, identically distributed samples from a N(0, S&) distribution. 

We present two additional simulations studies involving two-class datasets in Appendix 3. The first additional 
simulation uses the same network structure described above, and the second uses a single power law network 
with no block structure. 



7.1.2. Simulation results 

Our first set of simulations illustrates the effect of varying tuning parameters on the performances of FGL and 
GGL. We generated 100 three-class data sets with p — 500 features and n = 150 observations per class, as 
described in Section f7.1.1| Class l's network had 490 edges, class 2's network is missing 49 of those edges, and 
class 3's network is missing an additional 49 edges. Figure [2] shows the results, averaged over the 100 data sets. 
In each plot, the lines for FGL and for GGL indicate the results obtained with a single value of the similarity 
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The graphical lasso and the proposal of Guo et al. (2011 1 are included in the 



tuning parameters A2 and W2- 
comparisons. 

Figure [2ja) displays the number of true edges selected against the number of false edges selected. As the 
sparsity tuning parameters Ai and lo\ decrease, the number of edges selected increases. At many values of the 
similarity tuning parameter A2, FGL dominates the other methods. At some choices of the similarity tuning 
parameter W2, GGL performs as well as Guo et al. ( 2011[ ). FGL, GGL, and Guo et al. (2011 )'s proposal dominate 
the graphical lasso. 

Figure [2jb) displays the sum of squared errors (SS E) between estim ated edge values and true edge values: 



(S, )jj) 2 . Unlike the proposal of 



Guo et al. 



(2011), FGL, GGL, and the graphical lasso 



tend to overshrink edge values towards zero due to the use ot convex penalty functions. Thus, while FGL and 



GGL attain SSE values that are as low as those of Guo et al. (2011), they do so when estimating much larger 



networks. When simultaneous edge selection and estimation arc desired, it may be useful to run FGL or GGL 



once and then to re-run them with smaller penalties on the selected edges, as in Meinshausen (|2007 1 

Figure ^ 
classes 



§W ^ 9^ '. Since GGL, the proposal of 



cj evaluates each method's success in detecting differential edges, or edges that differ between 
For FGL, the number of differe ntial edges i s com puted as the number of pairs k < k' ,i < j such that 



«(*') 
n 



Guo et al. 



(2011), and the graphical lasso cannot yield edges that are 
exactly identical across classes, for those approaches the number of differential edges is computed as the number 
of pairs k < k',i < j such that |0y — 9^ \ > 10~ 2 . The number of true positive differential edges is plotted 
against the number of false positive differential edges. Note that by controlling the total number of non-zero 
edges, the sparsity tuning parameters Ai and u>i have a large effect on the number of edges that are estimated to 
differ between the two networks. FGL yields fewer false positives than the competing methods, since it shrinks 



between-class differences in edge values to zero. Since neither GGL nor Guo et al. (2011 )'s method are designed 



to shrink edge values towards each other, by this measure neither method outperforms even the graphical lasso. 
Figure |2jd) displays the sum of the dKL's of the estimated distributions from the true distributions, as a 



function of the l\ norm of the off-diagonal elements of the estimated precision matrices, i.e. ^2 k Ylutj l^ij I- The 
dKL from the multivariate normal model with inverse covariance estimates . . . , 0^) to the multivariate 

normal model with the true precision matrices S] -1 , . 



J K 



1 K 

- (" logdet(0«£ fc ) + trace(0«£ fc ) 



k=l 



At most values of A2, FGL attains a lower dKL than the other methods, followed by Guo et al. (2011 )'s method, 



then by GGL. The graphical lasso has the worst performance, since it estimates each network separately. 

Figure [2je) compares the methods' running times. Computation time (in seconds) is plotted against the 
total number of non-zero edges estimated. The graphical lasso is fastest, but FGL and GGL are much faster 
than the proposal of Guo et al. (2011 ), due to the results from Section |4j Timing comparisons were performed 
on an Intel Xeon x5680 3.3 GHz processor. It is worth mentioning that the FGL algorithm is much faster in 
problems with only two classes, since in that case there is a closed-form solution to the generalized fused lasso 
problem (Section |3.2| . This can be seen in Figures |l]je) and[5]je) in Appendix 3. 

We examined the FGL and GGL models with tuning parameters selected as described in Section [6] For 
FGL, AIC selected the tuning parameters Ai = 0.175, A2 = 0.025. Over the 100 replicate datasets, the FGL 
models with these tuning parameters averaged a dKL of 774, 884 true positive edges, 2406 false positive edges, 
77 true positive differential edges, and 4977 false positive differential edges. In GGL, AIC selected the tuning 
parameters wj = 0.225, 0J2 = 1. Over the 100 replicate datasets, the GGL models with these tuning parameters 
averaged a dKL of 776, 898 true positive edges, 736 false positive edges, 53 true positive differential edges, and 
1456 false positive differential edges. 



7.2. Performance as a function ofn andp 

We now evaluate the effect of sample size n and dimension p on the performances of FGL and GGL. 



7.2.1. Simulation set-up 

We generate a pair of networks with p = 500 much as described in Section 7.1.1 but with K = 2 instead of 
K = 3. The first network has 10 equal-sized components with power law degree distributions, and the second 
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Fig. 2. Performance of FGL, GGL, Guo et al. (2011)'s method, and the graphical lasso on simulated data with 150 
observations in each of 3 classes, and 500 features. Black lines display models derived using the graphical lasso, green 
lines display the proposal of Guo et al. (2011), red lines display FGL, and blue lines display GGL. (a): The number 
of edges correctly identified to be nonzero (TP Edges) is plotted against the number of edges incorrectly identified to be 
nonzero (FP edges), (b); The sum of squared errors in edge values is plotted against the total number of edges estimated 
to be nonzero, (c): The number of edges correctly found to have values differing between classes (TP Differential Edges) 
is plotted against the number of edges incorrectly found to have values differing between classes (FP Differential Edges). 
(d): The dKL of the estimated models from the true models is plotted against the i\ norm of the off-diagonal entries 
of the estimated precision matrices, (e): Running time (in seconds) is plotted against the number of non-zero edges 
estimated. Note the use of a log scale on the y-axis. 



network is identical to the first in both edge identity and value, but with two components removed. 

In addition to the 500-feature network pair, we generate a pair of networks with p = 1000 features, each 
of which is block diagonal with 500 x 500 blocks corresponding to two copies of the 500-feature networks just 
described. We generate covariance matrices from the networks exactly as described in Section |7.1.1| 



7.2.2. Simulation results 

For both the p = 500 and the p = 1000 networks, we simulate 100 datasets with n — 50, n = 200, and n = 500 
samples in each class. We run FGL with Ai = 0.2, A2 = 0.1 and GGL with Ai = 0.05, A2 = 0.25. We record 
in Table [I] dKL as well as the sensitivity and false discovery rate associated with detecting non-zero edges and 
detecting differential edges. In this simulation setting, accuracy of covariance estimation (as measured by dKL) 
improves significantly from n — 50 to n = 200, and improves only marginally with a further increase to n — 500. 
Detection of edges improves throughout the range of n's sampled: for both FGL and GGL, sensitivity improves 
slightly with increased sample size, and FDR decreases dramatically. Detection of edge differences is a more 
difficult problem, for which FGL performs well. 
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Table 1. Performances as a function ofn and p. Means over 100 replicates are shown 
for dKL, and for sensitivity (Sens.) and false discovery rate (FDR) of detection of 
edges (DE) and differential edge detection (DED). 





P 


n 


dKL 


DE Sens. 


DE FDR 


DED Sens. 


DED FDR 


FGL 


500 


50 
200 
500 


545.1 
517.5 
516.6 


0.502 
0.570 
0.590 


0.966 
0.053 
0.001 


0.262 
0.228 
0.192 


0.996 
0.485 
0.036 


1000 


50 
200 
500 


1119.3 
1035.0 
1033.3 


0.600 
0.666 
0.681 


0.970 
0.063 
0.000 


0.245 
0.223 
0.194 


0.998 
0.557 
0.025 


GGL 


500 


50 
200 
500 


549.8 
520.8 
519.7 


0.490 
0.505 
0.524 


0.973 
0.060 
0.010 


0.337 
0.244 
0.194 


0.996 
0.903 
0.921 


1000 


50 
200 
500 


1127.9 
1041.7 
1039.4 


0.587 
0.615 
0.629 


0.976 
0.061 
0.007 


0.316 
0.239 
0.197 


0.998 
0.908 
0.920 



8. Analysis of lung cancer microarray data 



We applied FGL to a dataset containing 22, 283 microarray-derived gene expression measurements from large 
airway epithelial cells sampled from 97 patients with lung cancer and 90 controls ( Spira et al.||2007 |. The data 
are publicly available from the Gene Expression Omnibus ( Barrett et al.|[2005 1 at accession number GDS2771. 
We omitted genes with standard deviations in the bottom 20% since a greater share of their variance is likely 
attributable to non-biological noise. The remaining genes were normalized to have mean zero and standard 
deviation one within each class. To avoid disparate levels of sparsity between the classes and to prevent the 
larger class from dominating the estimated networks, we weighted each class equally instead of by sample 
size in (2.4). Since our goal was data visualization and hypothesis generation, we chose a high value for the 
sparsity tuning parameter, Ai — 0.95, to yield very sparse network estimates. We ran FGL with a range of 
A 2 values in order to identify the edges that differed most strongly, and settled on A2 = 0.005 as providing 
the most interpretable results. Application of Theorem [T] revealed that only 278 genes were connected to any 
other gene using the chosen tuning parameters. Identification of block diagonal structure using Theorem [l] and 
application of the FGL algorithm took less than two minutes. (Note that this data set is so large that it would 



be computationally prohibitive to apply the proposal of |Guo et al. (2011 1!) FGL estimated 134 edges shared 
between the two networks, 202 edges present only in the cancer network, and 18 edges present only in the 
normal tissue network. The results are displayed in Figure [3] 




Fig. 3. Conditional dependency networks inferred from 17,772 genes in normal and cancerous lung cells. 278 genes 
have nonzero edges in at least one of the two networks. Black lines denote edges common to both classes. Red and green 
lines denote tumor-specific and normal-specific edges, respectively. 

The estimated networks contain many two-gene subnetworks common to both classes, a few small subnet- 
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works, and one large subnetwork specific to tumor cells. Reassuringly, 45% of edges, including almost all of 
the two-gene subnetworks, connect multiple probes for the same gene. Many other edges connect genes that 
are obviously related, involved in the same biological process, or even coding for components of the same en- 
zyme. Examples include TUBA1B and TUBA1C, PABPC1 and PABPC3, HLA-B and HLA-G, and SERPINB3 
and SERPINB4. Recovery of these pairs suggests that FGL (and other network analysis tools) can generate 
high-quality hypotheses about gene co-regulation and functional interactions. This increases our confidence 
that some of the non-obvious two-gene subnetworks detected in this analysis may merit further investigation. 
Examples include DAZAP2 and TCP1, PRKAR1A and CALM3, and BCLAF1 and SERPB1. A complete list 
of subnetworks detected is available in the Supplementary Materials. 

The small black and green network in Figure [3] suggests an interesting phenomenon. It contains multiple 
probes for two hemoglobin genes, HBA2 and HBB. In the normal tissue network, the probes for these genes 
are heavily interconnected both within and between the genes. In the tumor cells, while edges between HBA2 
probes and between HBB probes are preserved, no edges connect the two genes. The abundance of connections 
between the two genes in healthy cells and the absence of connections in tumor cells may indicate a possible 
direction of future investigation. 

The most promising results of this analysis arise from the large subnetwork (104 nodes for 84 unique genes) 
unique to tumor cells. Many of the subnetwork's genes are involved in constructing ribosomes, including RPS8, 
RPS23, RPS24, RPS7pll, RPL3, RPL5, RPL10A, RPL14P1, RPL15, RPL17, RPL30 and RPL31. Other genes 
in the subnetwork further involve ribosome functioning: SRP14 and SRP9L1 are involved in recruiting proteins 
from ribosomes into the ER, and NACA inhibits the SRP pathway. Thus this subnetwork portrays a detailed 
web of relationships consistent with known biology. More interestingly, this network also contains two genes in 
the RAS oncogene family: RAB1A and RAB11A. Genes in this family have been linked to many types of cancer, 
and are considered promising targets for therapeutics (Adjei 20081. These genes' connections with ribosome 
activity in the tumor samples may indicate a relationship common to an important subset of cancers. Many 
other genes belong to this network, each indicating a potentially novel interaction in cancer biology. 



9. Discussion 



We have introduced the joint graphical lasso, a method for estimating sparse inverse covariance matrices on the 
basis of observations drawn from distinct but related classes. We employ an ADMM algorithm to solve the joint 
graphical lasso problem with any convex penalty function, and we provide explicit and efficient solutions for 
two useful penalty functions. Our algorithm is tractable on very large datasets (> 20, 000 features), and usually 
converges in seconds for smaller problems (500 features). Our joint estimation methods outperform competing 
approaches on a range of simulated datasets. 

In the JGL optimization problem ( |2.4[ ), the contribution of each class to the penalized log likelihood is 
weighted by its size; consequently, the largest class can have outsize influence on the estimated networks. By 



omitting the n/- term in (2.4), it is possible to weight the classes equally to prevent a single class from dominating 
estimation. 

We note that FGL and GGL's reliance on two tuning parameters is a strength rather than a drawback: 



unlike the proposal of Guo et al. (2011), which involves a single tuning parameter that controls both sparsity 



and similarity, in performing FGL and GGL one can vary separately the amount of similarity and sparsity to 
enforce in the network estimates. 

The joint graphical lasso has potential applications beyond those discussed in this paper. For instance, 
one could use it to shrink multiple classes' precision matrices towards each other in order to define a classifier 



intermediate between quadratic discriminant analysis (QDA) and linear discriminant analysis (LDA) (Hastie 
et al. 20091. In fact, a similar approach has been taken in recent work (Simon & Tibshirani 2012[ ). In the 
unsupervised setting, it can be used in the maximization step of Gaussian model-based clustering in order to 
reduce the variance associated with estimating a separate covariance matrix for each cluster. 

An R package implementing FGL and GGL will be made available on CRAN, 
http : / / cran.r-project . org/. 
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Appendix 1 : Modifying JGL to work on the scale of partial correlations 

A reviewer suggested that under some circumstances, it may be preferable to encourage the K networks to have 
shared partial correlations, rather than shared precision matrices. Below, we describe a simple approach for 
extending our FGL proposal to work on the scale of partial correlations. A similar approach can be taken to 
extend GGL. The extension relies on two insights. 

(a) pij = — -j==L= , where pij is the true partial correlation between the ith and jth features, and where o % i 
is the (i,j)th entry of the true precision matrix. 

(b) The algorithm for solving the FGL optimization problem can easily be modified to make use of the 
following penalty function: 

?({&}) = E E a m,iO + E E w? } - ( 9 - 22 ) 

k=l i^j k<k' i,j 
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where X t ^j = Aj/v d l% a^ , t = 1, 2, and where <j i% is an estimate of the ith diagonal element of the K precision 
matrices. (Here, we assume the K precision matrices have shared diagonal elements.) The estimate {<r"} can 
be obtained in a number of ways, for instance by performing the graphical lasso on the samples from all K 
data sets together. Then this approach will effectively result in applying a generalized fused lasso penalty to 
the partial correlations for the K classes. 



Appendix 2: Proofs of Theorems [T|and|2] 

Preliminaries to Proofs of Theorems\l\and\^ 

(k)\ . (k) 

We begin with a few comments on subgradients. The subgradient of \o\j \ with respect to v\a equals 

i if ef> > o 

-1 if Op <0 , 
a if 6»( fc) = 

for some a G [—1,1]. The subgradient of \of^ — 0^ ^| with respect to (0^,0^ ^) for k ^ k' equals (d, —d), where 



i if flW > 6p 

i ifC<<f 
a if flW = eip 



for some a G [—1, 1]. Finally, the subgradient of yEfei(^!j') 2 with respect to (Off, ■ ■ ■ , fljj^) is given by 

f (*g ) ,...,«g r) )/E£. 1 («S ) ) a ifEti(^) 2 >o 

\ ;T,, ; T K(; ; if 0g> =... = <?<*> =0 ' 

for some T 1:ij , . . . , such that J2k=i - 1- 

To prove Theorem [l] we will use the following lemma. 

Lemma 9.1. The following two sets of conditions are equivalent: 

(A) : \niSi\ < Ai + A 2 , (na^l < Ai + A 2 , and \niSi + n 2 S 2 \ < 2\i- 

(B) : There exist Fi,r2, T G [—1, 1] such that —n\S\ — X\Ti — A 2 Y = 0, and —112S2 — AiT 2 + A2T = 0. 

Proof. We will begin by proving that (B) implies (A), and will then prove that (^4) implies (B). 
Proof that (B) =$> (A): 

First of all, — n-iSi — AiTi — A 2 Y = implies that |niSi| < Ai + A 2 , since Ti,T G [—1,1]- Similarly, 
— n 2 S' 2 — AiT 2 + A 2 T = implies that 1 722 ^2 1 < Ai + A 2 . Finally, summing the two equations in (B) re- 
veals that n\S\ + n 2 S 2 = — Ai(Fi + T 2 ), which implies that |ni5i + n 2 S' 2 | < 2Ai. 

Proof that (A) (B): 

Without loss of generality, assume that n\S\ > n 2 S' 2 . We split the proof into two cases. 

(a) Case 1: n\Si — n 2 5 2 < 2A 2 . 

Let T! = T 2 = -"igi-"2S2 ; and T = - ni Sj+n 2 S 2 _ 

First, note that by (A), we know that \n1S1 +n 2 S' 2 | < 2Ai. Therefore, Ti,r 2 G [—1, 1]. Second, note that 
Case l's assumption that n±Si — n 2 5 2 < 2A 2 implies that T G [—1, 1]. Finally, we see by inspection that 
— n\S\ — \\T\ — A 2 T = 0, and — n 2 S* 2 — AiT 2 + A 2 T = 0. 
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(b) Case 2: n\Si — > 2A2. 

Let Yx = ="i|i+^a , r 2 = ~" 2 f 2 ~ A2 , and T = -1. Then, by inspection, -n^x - X l Y 1 - A 2 T = 0, and 
-n 2 S 2 - Air 2 + A 2 T - 0. 

It remains to show that ri,r2,T e [—1,1]. Trivially, T = -1 e [—1,1]. From our assumption that 
|ii<Si| < Ai + A2, we know that —1 < Y\. Moreover, by the assumptions that n\S\ — n 2 S2 > 2A 2 and 
\n1S1 + ri2S2\ < 2Ai, we have that 



Ai Ai 2Ai 



__M5 1 + A 2 ^-^i+^( 5i ^f^) 
Therefore Ti e [-1,1]. 

By the assumption that 1712521 < Ai + A2, we know that I^ = ~" 2 f^~ Aa < 1- From the assumptions that 
niSi —11282 > 2A2 and |niSi + tt-2 5*2 1 < 2Ai, we have that 



r, Q \ I ■n 1 Si—n 2 S2 \ 

-n 2 S 2 - A 2 . - Aa { 2xi ) 



-niSi - n 2 S; 



T 2 = — — > ^ — '- = > -1. (9.24) 

Ai Ai 2Ai 

Therefore T 2 € [-1,1]. 



Thus we conclude (A) =>■ (B), and our proof of Lemma 9.1 is complete. □ 

We will make use of the following lemma in order to prove Theorem [2] 
Lemma 9.2. The following two conditions are equivalent: 

(A) : There exist scalars a\, . . . ,ok such that ^DfcLi a \ — 1 an d n k\Sk\ < Ai + \2a k for all k = 1, . . . ,K. 

(B) : There exist scalars Y±, . . . , Tk € [—1, 1] and Tx, ■ ■ ■ , "Yk such that 53feLi ^-t — 1 an( ^ n kSk + AiTfe + 
A 2 T fe = for k = l,...,K. 

Proof. We will begin by proving that (B) implies (A), and will then show that (A) implies (B). 
Proof that (B) => (A) : 

By (B), n k \S k \ = \XiY k + A 2 T fc | < Ai|r fe | + X 2 \T k \ < Ai + A 2 |T fc |. Letting a k = \T k \, the result holds. 
Proof that (A) (B) : 

Let Yk and T k take the following forms, for k = 1, . . . , K : 

-1 if n k S k > Ai 

-n k S k /\i if - Ai < n k S k < Ai (9.25) 
1 if n k S k < -Ai 

(-n k Sk + Ai)/A 2 if n k S k > Ai 
T,, = { if - Ax < n k S k < Ax (9.26) 

_(-n k S k - Ai)/A 2 if n k S k < -A x 

First of all, we note by inspection that Y k e [—1, 1] and that n k S k + XiY k + A 2 Tfc = for k = 1, . . . , K, It 
remains to show that Y^ k =i ^ k — 1- Specifically, we will show that T? < a? for k = 1, . . . , K. To see why 
this is the case, note that if — Ai < n k S k < Ai then = < a\. And if n k S k > Ai or n k S k < — Ai, then 

" 2 - ( nklS * l - Xl ) 2 <a 2 k . □ 
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Proof of TheoremU] 

We first consider the claim for the case K = 2. By the Karush-Kuhn- Tucker (KKT; see e.g. Boyd & Van- 
denberghe||2004[) conditions, a necessary and sufficient set of conditions for {0} to be the solution to the JGL 



problem is that 



= ni(0 (1) ) _1 -niS (1) - Ailoi - A 2 T 
= n 2 (eW)- 1 -n2S< 2 )-A 1 r 2 + A2T, 



(9.27) 



where is the subgradient of \0$\ with respect to 9^' , T 2> ij is the subgradient of \9\^\ with respect to 



and Tij is the subgradient of — 9\^'\ with respect to 

Let C\ and C 2 be a partition of the p variables into two nonoverlapping sets, with C\ n C 2 = 0, C\ U C2 
{1, . . . ,p}. Consider the matrices 



3(2) I 



(1) 

ij ' A 

(1) 
ij 



3(2) I 



,(2) 



0(D 








(2) 







(2) 











(2) 



(9.28) 



where and ©^ solve the JGL pro blem on the features in C%, and ©3 and 2 2 ^ solve the JGL problem 
on the features in C 2 . By inspection of (9.27), ©W and &^ solve the entire JGL optimization problem if and 



only if for all i G C\, j G C2, there exist Ti ij,T2 ij, G [—1, 1] such that 



-n 2 S, 



(2) 



^1-^2, ij 



A9T 



2 J-ij 



= 
= 0. 



(9.29) 



Therefore, by Lemma 9.1 the proof of the claim for the case K = 2 is complete. 

The derivation of the necessary condition for the case K > 2 is simple and we omit it here. 



Proof of Theorem^ 

We note that Theorem [2js condition (4.191 is equivalent to the following: 

\n k s\f\ < Ai + \ 2 a ij<k for all i G C u j G C 2 , k = 1, . 
where a^i, . . . , ay _k are scalars that satisfy X}&=i a2 j & — 1- We will prove that (19.30 1 is necessary and sufficient 



,K 



(9.30) 



for the variables in C\ to be completely disconnected from those in C 2 in each of the resulting network estimates. 



By the KKT conditions, a necessary and sufficient set of conditions for {©} to be the solution to the JGL 
problem is that 

0-n fe (©W)- 1 -n fe SW-A 1 r fe -A 2 T fc (9.31) 



for k = 1,...,K. In (9.31), ^u,ij is the subgradient of \9^\ with respect to and (Ti,#, . . . , ^K,ij) is the 



subgradient of ^Y^k^i^ij') 2 with respect to (9\^ , . . . ,9, 

Let C\ and C 2 be a partition of the p variables into two nonoverlapping sets, with C\ H C 2 — 0, C\ U C 2 
{!,...,!»}. Consider the matrices of the form 



g(*0\2 



(1) 



0(fc) 



©i 



(fc) 







0, fc) 



(9.32) 



for fe = 1, . . . , K, where 0^ 



(i) 



, ©1 solve the JGL p roble m on the features in Ci, and ©3 , . . . , ©^ solve 



the JGL problem on the features in C 2 . By inspection of (9.31 1, @W, . . . , 8^ solve the entire JGL optimization 
problem if and only if for all i G C\, j G C 2 , there exist ■ ■ ■ , ^K,ij € [ — L 1] and Ti.y , . . . , ¥K,ij satisfying 
EfcLi t mj < 1 such that 

- n fc 4f } " AiL fc ,„- - A 2 T Mj = (9.33) 



Therefore, by Lemma 9.2 the proof is complete. 
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Appendix 3: Additional simulations for two-class datasets 



We first present results for a simulation study similar to the one in Section |7.1| but with only two classes. 
Taking an approach similar to the one described in Section 7.1 we defined two networks with p — 500 features 
belonging to ten equally sized unconnected subnetworks, each with a power law degree distribution. Of the 
ten subnetworks, eight have the same structure and edge values in both classes, and two are present in only 
one class. Class l's network has 490 edges, 94 of which are not present in class 2. We generated covariance 
matrices as described in Section [7.1.11 Again, we simulated 100 datasets with n = 150 observations per class. 
The results shown in Figure [4] are similar to the results in Section 7.1.2 




Fig. 4. Performance of FGL, GGL, Guo et al. ,2011)'s method, and the graphical lasso on simulated data with 150 
observations in each of 2 classes, and 500 features corresponding to ten equally sized unconnected subnetworks drawn 



from a power law distribution. Details are as given in Figure Q 



We also simulated data with an entirely different network structure. Instead of the block-diagonal network 
structure used in the previous simulations, in this simulation we generated data drawn from a single large 
power law network. We defined class l's network to be a single power law network with only one component 
and generated Si as described in Section [7.1.11 We then identified a branch in this network connected to the 
rest of the network through only one edge. We then let S^ 1 equal , except for the elements corresponding 
to the edges in the selected branch, which were set to be zero instead. Finally, we defined £2 by inverting 
S^ 1 , an d generated the two classes' data using Si and S2. This yielded distributions based on two power 
law networks that were identical except for a missing branch in class 2. Class l's network has 499 edges, 104 
of which are not present in class 2. We simulated 100 datasets with n = 150 observations per class. Figure [5] 
shows the results, averaged over the 100 data sets. Again, FGL and GGL were superior to or competitive with 
the other methods. 



Appendix 4: Network structure used in simulations 

The network structure for the simulations described in Section |7.1| is displayed in Figure [6j Black edges are 
shared between all three classes' networks, green edges are present only in classes 1 and 2, and red edges are 
present only in class 1. 
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Guo et al. ,'201 D's method, 



Fig. 5. Performance of FGL, GGL 

observations in each of 2 classes, and 500 features corresponding to a single large power law network 



and the graphical lasso on simulated data with 150 

Details are as 



given in Figure^^ 
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Fig. 6. Network used to generate simulated datasets for Figure^in Section \7.1.2\ Black edges are common to all three 
classes, green edges are present only in classes 1 and 2, and red edges are present only in class 1. 



